Consider a node with a failed reusable replica as still used #2650

ejweber · 2024-02-26T22:47:50Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

When replicaNodeSoftAntiAffinity == false, the scheduler should not schedule a second replica to a node that has a failed replica of the same volume. When replicaNodeSoftAntiAffinity == true, it should.

When replicaNodeSoftAntiAffinity == false and there is already a failed replica for a volume scheduled to a node, the scheduler should ONLY consider that node for scheduling if:

The failed replica is no longer usable (e.g. spec.rebuildRetryCount >= 5), or
replica-replenishment-wait-interval is exceeded.

ejweber · 2024-02-27T21:47:26Z

To test:

Follow the "Observe the root cause" steps from longhorn/longhorn#8043 (comment). I don't recommend trying to follow the "Cause a lockup" steps, because the reproducibility is low.

Watch the replicas while the node reboots. At no point is any other replica scheduled to the node. The end result is a a volume with one running replica and two unscheduled replicas. The two unscheduled replicas are different than the ones we started with due to various replica cleanup and replenishment behaviors.

eweber@laptop:~/longhorn> kl get replica -w --output-watch-events
EVENT      NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   18m
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           5m48s
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           5m47s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            error     eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b                                            18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            error     eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b                                            18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           6m41s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           6m41s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           6m40s
DELETED    pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-68e6ec16   v1            stopped                                                                                                                                                                           6m41s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           6m40s
DELETED    pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-7b801907   v1            stopped                                                                                                                                                                           6m40s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            stopped   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f                                                                                                18m
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   18m
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   19m
ADDED      pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1                                                                                                                                                                                              0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1            stopped                                                                                                                                                                           0s
MODIFIED   pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1            stopped                                                                                                                                                                           0s

eweber@laptop:~/longhorn> kl get replica
NAME                                                  DATA ENGINE   STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-64fed43c   v1            running   eweber-v126-worker-9c1451b4-kgxdq   5680b199-91bd-452e-bbb6-4eeee965bf2f   instance-manager-699da83c0e9d22726e667344227e096b   longhornio/longhorn-engine:master-head   22m
pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f71dc284   v1            stopped                                                                                                                                                                           3m43s
pvc-ecf181b7-8520-4dc8-bfb5-1cb5c7c044d0-r-f81f51f8   v1            stopped                                                                                                                                                                           3m43s

ejweber · 2024-02-28T16:43:03Z

This PR is currently problematic because it fails e2e tests like test_single_replica_failed_during_engine_start. That tests sets replica-replenishment-wait-interval = 0 and assumes that a new replica can be immediately spun up to replace the failed one. This PR does not allow the new replica to schedule to the existing node with a failed replica. (Everything will unblock after the failed replica attempts to rebuild five times and is deleted, but the way the docs for replica-replenishment-wait-interval are worded, we should NOT wait that long.)

ejweber · 2024-03-04T17:02:54Z

After discussing it with @PhanLe1010 and @james-munson, we decided the node should be considered used only if the failed replica is potentially reusable and replica-replenishment-wait-interval has not been exceeded. This avoids breaking existing expected behavior (we will schedule another replica immediately after replica-replenishment-wait-interval, even if it is to a node containing a failed replica). With the changes, this PR is still effective in the original test case. However, the changes may limit the usefulness of this PR in some situations, as it is likely possible to manufacture a scenario in which we can still hit the issue. For example:

The volume has two replicas (instead of one).
The node that restarts is NOT the node running the engine. The volume becomes degraded instead of faulted.
The node takes longer than replica-replenishment-wait-interval to come back.
When it comes back, Longhorn schedules the third replica to the node, even though it already contains the second one.

I would prefer to avoid the above potential scenario. However, it should not result in a lockup like the original test case. I think it is better to maintain the existing behavior scheduling replicas to nodes with existing failed replicas with this best-effort fix.

shuo-wu

I didn't check the test part but the implementation LGTM

ejweber · 2024-03-05T22:18:12Z

Three failures in https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6587/.

PhanLe1010 · 2024-03-06T01:01:01Z

The general idea LGTM. Sorry, I cannot review this PR in detail as time pressure on other tasks. Will defer to @shuo-wu and @james-munson to drive the review

james-munson

Generally LGTM. Just a couple of small questions, but overall this is both clearer and more capable.

scheduler/replica_scheduler.go

Only do this for the purposes of scheduling new replicas. Maintain previous behavior when checking for reusable replicas. Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>

Consider a node with a failed replica as used if the failed replica is potentially reusable and replica-replenishment-wait-interval hasn't expired. Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>

Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2024-03-07T16:52:45Z

@mergify backport v1.6.x

mergify · 2024-03-07T16:53:03Z

backport v1.6.x

✅ Backports have been created

#2677 Consider a node with a failed reusable replica as still used (backport #2650) has been created for branch v1.6.x

ejweber · 2024-03-07T17:19:15Z

@mergify backport v1.5.x

mergify · 2024-03-07T17:19:41Z

backport v1.5.x

✅ Backports have been created

#2679 Consider a node with a failed reusable replica as still used (backport #2650) has been created for branch v1.5.x but encountered conflicts

ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch 2 times, most recently from cf71863 to b8d0efc Compare February 27, 2024 23:45

ejweber mentioned this pull request Feb 28, 2024

Fix HA tests for scheduling change longhorn/longhorn-tests#1787

Closed

ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch 3 times, most recently from eaf176c to bb06506 Compare March 1, 2024 22:12

ejweber changed the title ~~Consider a node with a failed replica as still used~~ Consider a node with a failed reusable replica as still used Mar 4, 2024

ejweber marked this pull request as ready for review March 4, 2024 17:02

ejweber requested a review from a team as a code owner March 4, 2024 17:02

ejweber mentioned this pull request Mar 4, 2024

[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica longhorn/longhorn#8043

Closed

shuo-wu previously approved these changes Mar 5, 2024

View reviewed changes

james-munson requested changes Mar 6, 2024

View reviewed changes

scheduler/replica_scheduler.go Show resolved Hide resolved

scheduler/replica_scheduler.go Show resolved Hide resolved

scheduler/replica_scheduler.go Show resolved Hide resolved

ejweber dismissed shuo-wu’s stale review via 823e2c3 March 6, 2024 21:46

ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch from 823e2c3 to e282109 Compare March 6, 2024 21:58

ejweber requested a review from james-munson March 6, 2024 22:05

james-munson approved these changes Mar 6, 2024

View reviewed changes

ejweber added 3 commits March 6, 2024 16:21

Consider a node with a failed replica as still used

07e9a7f

Only do this for the purposes of scheduling new replicas. Maintain previous behavior when checking for reusable replicas. Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>

Refine scheduling behavior with failed replicas

64aa6cb

Consider a node with a failed replica as used if the failed replica is potentially reusable and replica-replenishment-wait-interval hasn't expired. Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>

Simplify logic in timeToReplacementReplica

f371f2a

Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 8043-avoid-scheduling-a-second-replica branch from e282109 to f371f2a Compare March 6, 2024 22:22

shuo-wu approved these changes Mar 7, 2024

View reviewed changes

shuo-wu merged commit e685946 into longhorn:master Mar 7, 2024
5 checks passed

mergify bot mentioned this pull request Mar 7, 2024

Consider a node with a failed reusable replica as still used (backport #2650) #2677

Merged

ejweber mentioned this pull request Mar 7, 2024

[BACKPORT][v1.6.1][BUG] A replica may be incorrectly scheduled to a node with an existing failed replica longhorn/longhorn#8044

Closed

mergify bot mentioned this pull request Mar 7, 2024

Consider a node with a failed reusable replica as still used (backport #2650) #2679

Merged

ejweber mentioned this pull request Mar 7, 2024

[BACKPORT][v1.5.5][BUG] A replica may be incorrectly scheduled to a node with an existing failed replica longhorn/longhorn#8116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider a node with a failed reusable replica as still used #2650

Consider a node with a failed reusable replica as still used #2650

ejweber commented Feb 26, 2024 •

edited

Loading

ejweber commented Feb 27, 2024

ejweber commented Feb 28, 2024

ejweber commented Mar 4, 2024

shuo-wu left a comment

ejweber commented Mar 5, 2024 •

edited

Loading

PhanLe1010 commented Mar 6, 2024 •

edited

Loading

james-munson left a comment

ejweber commented Mar 7, 2024

mergify bot commented Mar 7, 2024 •

edited

Loading

ejweber commented Mar 7, 2024

mergify bot commented Mar 7, 2024 •

edited

Loading

Consider a node with a failed reusable replica as still used #2650

Consider a node with a failed reusable replica as still used #2650

Conversation

ejweber commented Feb 26, 2024 • edited Loading

Which issue(s) this PR fixes:

What this PR does / why we need it:

ejweber commented Feb 27, 2024

ejweber commented Feb 28, 2024

ejweber commented Mar 4, 2024

shuo-wu left a comment

Choose a reason for hiding this comment

ejweber commented Mar 5, 2024 • edited Loading

PhanLe1010 commented Mar 6, 2024 • edited Loading

james-munson left a comment

Choose a reason for hiding this comment

ejweber commented Mar 7, 2024

mergify bot commented Mar 7, 2024 • edited Loading

✅ Backports have been created

ejweber commented Mar 7, 2024

mergify bot commented Mar 7, 2024 • edited Loading

✅ Backports have been created

ejweber commented Feb 26, 2024 •

edited

Loading

ejweber commented Mar 5, 2024 •

edited

Loading

PhanLe1010 commented Mar 6, 2024 •

edited

Loading

mergify bot commented Mar 7, 2024 •

edited

Loading

mergify bot commented Mar 7, 2024 •

edited

Loading